A reasonably fast casefold implementation#122
Open
aneubeck wants to merge 9 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new casefold crate providing a compact, fast Unicode simple (1-to-1) case-folding implementation intended for use in case-insensitive indexing (notably alongside sparse-ngrams).
Changes:
- Introduces
casefold::simple_fold(char) -> charbacked by a generated paged-bitmap + packed-run table. - Adds a build script that parses Unicode
CaseFolding.txtto generate the compressed table at build time. - Adds a dedicated
casefold-benchmarkscrate and wires it into the workspace.
Show a summary per file
| File | Description |
|---|---|
| crates/casefold/src/lib.rs | Implements simple_fold and the paged-bitmap lookup logic plus correctness/size tests. |
| crates/casefold/README.md | Documents the encoding approach, table layout, and benchmark results. |
| crates/casefold/data/CaseFolding.txt | Vendors Unicode 16.0 CaseFolding.txt used for generating and testing the table. |
| crates/casefold/Cargo.toml | Declares the new casefold crate package metadata. |
| crates/casefold/build.rs | Build-time generator that parses CaseFolding.txt and emits the packed table into OUT_DIR. |
| crates/casefold/benchmarks/performance.rs | Criterion benchmark comparing the table implementation vs a HashMap baseline across workloads. |
| crates/casefold/benchmarks/lib.rs | Benchmark helper code for building the reference HashMap implementation. |
| crates/casefold/benchmarks/Cargo.toml | Declares the casefold-benchmarks crate and its dependencies. |
| Cargo.toml | Adds crates/casefold/benchmarks to the workspace members. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 9/9 changed files
- Comments generated: 8
Comment on lines
+39
to
+42
| .map(|_| { | ||
| let b = rng.random_range(b'A'..=b'z'); | ||
| b as char | ||
| }) |
Comment on lines
+161
to
+166
| let ends: Vec<u32> = runs | ||
| .iter() | ||
| .map(|r| r.start + (r.length as u32 - 1) * (r.stride as u32)) | ||
| .collect(); | ||
| let last_covered = *ends.last().unwrap(); | ||
|
|
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is useful in combination with sparse-ngrams indexing.